cFIT (common Factor Space Integration & Transfer) is a tool for data integration and transfer for scRNAseq data. It is applied to data from multiple labs and experimental conditions, technologies and even species.
Our proposed method models the shared information between various data sets by a common factor space, while allowing for unique distortions and shift per batch. The model parameters are learned under and iterative non-negative matrix factorization (NMF) framework and then used for synchronized intefration from across-domain assays. In addtion, the model enables transferring via low-rank matrix from more informative data to allow for precise identification of data of lower quality.
First load the package
library(cFIT)
devtools::load_all('~/Dropbox/research/cFIT/cFIT/')
We illustrate the standard workflow of data integration as well as data transfer, using well analyzed pancreas islet cells dara sets, retrieved from SeuratData package (Stuart et al. (2019)).
library(SeuratData)
data("panc8")
cFIT integration is composed of the following procedure:
The panc8 is saved as Seurat object, we extract the raw expression and cell metadata.
# extract the raw counts and metadata for data sets from 5 technologies
data.list = split_dataset_by_batch(X=t(as.matrix(panc8@assays$RNA@counts)),
batch = panc8@meta.data$tech,
labels = panc8@meta.data$celltype,
metadata = panc8@meta.data,
dataset.name = 'panc:')
Select 2000 highly variable genes
# select highly variable genes
genes = select_genes(data.list$X.list, ngenes=2000, verbose=F)
Preprocess to normalize by library size and log transform
# data preprocessing
exprs.list = preprocess_for_integration(data.list$X.list, genes, scale.factor=10^4, scale=T, center=F)
Perform data integration with CFITIntegrate. The only parameter to adjust is r, which is the number of factors of the common factor matrix (shared across datasets). The algorithm is relatively robust to r as long as r approximate the number of underlying distinct cell types across all data sets. Other minor parameters that can be adjusted is max.niter (100 by default), tol tolerance used in stopping criteria (1e-5 by default), gamma tunning of the penalty term (1e6 by default), nrep number of repeats of the run (1 by default, require much longer if increased), init parameter initialization (NULL).
# takes roughly 20 minutes to run
int.out = CFITIntegrate(X.list=exprs.list, r=15, verbose=T, seed=0)
saveRDS(int.out, file='results/panc5_integration_output.rds') # /Users/minshipeng/Dropbox/research/cFIT/cFIT/vignettes/
The output is a list containing
W, factor loading matrices (per datasets) H.list, dataset specific scaling lambda.list and shift b.listconvergence, number of iterations niter, and final objective function value obj, final objective delta deltagamma, max.iter, tol, repWe obtain the integrated data via
# ncell-by-ngene expression matrix
exprs.int = do.call(rbind, int.out$H.list) %*% t(int.out$W)
# ncell-by-r low dimensiional representation
Hnorm = do.call(rbind, int.out$H.list) %*% diag(colSums(int.out$W))
Visualize the integrated data via UMAP plot,
tech = do.call(c, lapply(data.list$metadata.list, function(x) x$tech))
celltype = do.call(c, data.list$labels.list)
umap.out = plot_umap(X=Hnorm, labels=celltype,
pca = NULL, n_components = 2, n_neighbors = 50, min_dist = 0.1, # umap parameters
point.size = 0.6, alpha=0.8, title=NULL, legend.name='cell type', # figure parameters
seed=42)
p1 = umap.out$p # colored by technologies
p2 = plot_umap(labels=tech, point.size = 0.5, alpha=0.5, legend.name='technology', emb=umap.out$emb)$p # colored by cell types
p1
p2
Figure 1: Integrated pancreas islet cells data sests from five technologies
Calculate the alignment score for each datasets
alignment.score.out = calculate_alignment(X.list=list(Hnorm), k=30, balanced=F, dataset=tech, labels=celltype)
# alignment.score.out$alignment.per.dataset
# celseq celseq2 fluidigmc1 indrop smartseq2
# 0.8175777 0.7089732 0.8483672 0.6431162 0.7169563
We first integrate data sets from four technologies (inDrop, celseq, celseq2, SmartSeq2), and then transfer the learned factor matrix to the smallest dataset–fluidigmc1, which is from a very different technology from the other seven source data sets.
# perform integration to estimate Wref with data sets from 4 technologies
# run time 20 min
int.ref.out = CFITIntegrate(X.list=exprs.list[c(1,2,3,5)], r=15, verbose=T, seed=42)
saveRDS(int.ref.out, file='results/panc_ref_integration_output.rds')
The learned common factor matrix, saved in int.ref.out is then transfer to the smallest data set, containing 638 cells from fluidigmc1 technology.
# transfer: run time 10 seconds
tf.out = CFITTransfer(Xtarget=exprs.list[[4]], Wref=int.ref.out$W, seed=0, verbose=F)
Visualize the transferred results via UMAP plot,
Hnorm = rbind(do.call(rbind, int.ref.out$H.list), tf.out$H) %*% diag(colSums(int.ref.out$W))
source = rep(c('reference','target'), c(nrow(do.call(rbind, int.ref.out$H.list)), nrow(tf.out$H)))
celltype = do.call(c, c(data.list$labels.list[c(1,2,3,5)],data.list$labels.list[4]))
umap.out = plot_umap(X=Hnorm, labels=source, min_dist = 0.1, # umap parameters
point.size = 0.6, alpha=0.8, title=NULL, legend.name='source',
cols=c('grey80','red'), seed=0)
p1 = umap.out$p # colored by source
p2 = plot_umap(labels=celltype, point.size = 0.5, alpha=0.5, legend.name='cell type',
emb=umap.out$emb)$p # colored by cell types
p1+p2
Figure 2: cFIT transfer results
Assign labels for each cell in the target data by querying the cell type of target cells within k nearest neighbors.
est.labels = asign_labels(exprs.source=do.call(rbind, int.ref.out$H.list),
exprs.target=tf.out$H,
labels.source=do.call(c, data.list$labels.list[c(1,2,3,5)]))
plotContTable(est_label=est.labels, true_label=data.list$labels.list[[4]], ylab='Mapped type')
Figure 3: Confution matrix comparing the mapped cell type with the reference labels
We provide tools for simulating data sets. The data can be generated either from our proposed model, or from models provided by a widely employed scRNAseq data simulator Splatter package (Zappia, Phipson, and Oshlack (2017))
Here we show an example of simulating 5 datasets (ntask), each composed of 2000 samples (n), 500 features (p), from 8 underlying clusters (K). The cluster separation in controled by the parameter cl.seq, and the batch distinction is determined by batch.effect.sig, and the within cluster variance is controled by sig. alpha determines the group proportion, where a smaller alpha corresponds to more unbalanced datasets.
data.out = generate_data(n=2000, p=500, ntask=5, K=6, cl.sep=1, sig=1, batch.effect.sig=1, alpha=0.5)
where the ouput contains all the simulated expression matrix saved in a list X.list and all the model parameters W, H.list, lambda.list, b.list. We can see the the simulated data is grouped by both cluster and batch.
labels = as.character(do.call(c, data.out$label.list))
batchs = as.character(rep(1:length(data.out$X.list), each=length(data.out$label.list[[1]])))
umap.out = plot_umap(X=do.call(rbind, data.out$X.list), labels=labels,
min_dist = 0.4, point.size = 0.6, alpha = 0.5, legend.name = 'cluster')
p1 = umap.out$p
p2 = plot_umap(labels=batchs, point.size = 0.5, alpha=0.5, legend.name='batch', emb=umap.out$emb)$p
p1+p2
Figure 4: Simulated data from assumed model
By cFIT we get the integrated results
int.out = CFITIntegrate(X.list=data.out$X.list, r=6, verbose=T, tol=5e-5, seed=42)
saveRDS(int.out, file='results/simu_model_integration_output.rds')
Hnorm = do.call(rbind, int.out$H.list) %*% diag(colSums(int.out$W))
umap.out = plot_umap(X=Hnorm, labels=labels, point.size = 0.5, alpha=0.5, legend.name='cell type')
p1 = umap.out$p # colored by technologies
p2 = plot_umap(labels=batchs, point.size = 0.5, alpha=0.5, legend.name='batch', emb=umap.out$emb)$p # colored by cell types
p1+p2
Figure 5: Simulated data from assumed model with batch effects corrected applying cFIT
More details to add …
The computation complexity of cFIT scales linearly with the number of cells. It can accommodate tens of thousands of cells within reasonable run time on a standard PC (depending on the number of genes, p, and the number of factors r). To speed up the processing of millions of cells, we implemented a fast version of the algorithm employing the idea of random sketching and stochastic proximal point method (SPP). Details can be found in our manuscipts.
The speed up version is implemented by CFITIntegrate_sketched function. The usage is similar to CFITIntegrate with two additional parameters:
subsample.prop: the fraction to be subsampled in each parameter update, in range (0,1]. Default is set to min(5*10^4/n, 1). Smaller value will results in faster computation for each iteration, but too small the value would result in higher estimation error.weight.list: a list of vectors of weights for the corresonding samples saved as a list of data matrices X.list. If provided, cells will be subsampled based on the relative weights instead of uniformly.We examine the performance of the speed up with the following experiments:
# compare runing time with increasing sample per batch
# Note that this experiment will takes hours!
n.list=c(5000, 10000, 50000)
subsample.prop.list = c(1, 0.5, 0.1, 0.01)
ntask = 5 # five batches in total
p = 500 # number of genes
r = 10 # number of factors
runtime.mat = matrix(NA, length(n.list), length(subsample.prop.list))
obj.mat = matrix(NA, length(n.list), length(subsample.prop.list))
deltaw.mat = matrix(NA, length(n.list), length(subsample.prop.list))
info.list = list()
set.seed(42)
for (i in 1:length(n.list)){
n = n.list[i]
message('Run experiment with n=',n)
data.out = generate_data(n=n, p=p, ntask=ntask, K=r, cl.sep=1, sig=2, batch.effect.sig=0.5, alpha=1)
for (j in 1:length(subsample.prop.list)){
subsample.prop = subsample.prop.list[j]
message('subsample.prop=',subsample.prop,'...')
start.time = Sys.time()
int.out = CFITIntegrate_sketched(X.list=data.out$X.list, r=r, verbose=T, subsample.prop=subsample.prop,
tol=1e-5, seed=42, max.niter=50, n.cores = 4)
end.time = Sys.time()
time = difftime(end.time, start.time,units = 'm')
message('Finish, run ', int.out$niter,' iters in time:', time,'.')
runtime.mat[i, j] = time
obj.mat[i,j] = int.out$obj
deltaw.mat[i,j] = int.out$deltaw
info.list = c(info.list,list(list(n = n, subsample.prop=subsample.prop, deltaw.history=int.out$deltaw.history)))
}
}
saveRDS(list(runtime.mat=runtime.mat, obj.mat=obj.mat, deltaw.mat=deltaw.mat, info.list=info.list,
n.list=n.list, subsample.prop.list=subsample.prop.list, p=p, ntask=ntask, r=r), 'results/simu_sketch.rds')
Examine the results, including the run time, the objective functions
out = readRDS('results/simu_sketch.rds')
runtime.mat=out$runtime.mat
obj.mat=out$obj.mat
deltaw.mat=out$deltaw.mat
info.list=out$info.list
n.list=out$n.list
subsample.prop.list = out$subsample.prop.list
ntask = out$ntask
df = data.frame(runtime = c(runtime.mat),
obj = c(obj.mat),
deltaw = c(deltaw.mat),
n = ntask * rep(n.list, ncol(runtime.mat)),
subsample = rep(subsample.prop.list, each=nrow(runtime.mat)))
p1 = ggplot(data=df) + geom_point(aes(x=as.character(subsample), y=runtime, color=as.factor(n))) +
geom_line(aes(x=as.character(subsample), y=runtime, group=as.factor(n))) +
labs(x='subsample proportion', y='run time (min)') +
scale_y_continuous(trans = "log10")+
labs(color = "sample size") + theme_light()
p2 = ggplot(data=df) + geom_point(aes(x=as.character(subsample), y=obj, color=as.factor(n))) +
geom_line(aes(x=as.character(subsample), y=obj, group=as.factor(n))) +
labs(x='subsample proportion', y='objective function') +
scale_y_continuous(trans = "log10")
labs(color = "sample size") + theme_light()
p1+p2
Figure 6: Examine the run time (left) and objective function with sketched cFIT with varying subsampling fractions
Stuart, Tim, Andrew Butler, Paul Hoffman, Christoph Hafemeister, Efthymia Papalexi, William M Mauck III, Yuhan Hao, Marlon Stoeckius, Peter Smibert, and Rahul Satija. 2019. “Comprehensive Integration of Single-Cell Data.” Cell 177 (7). Elsevier: 1888–1902.
Zappia, Luke, Belinda Phipson, and Alicia Oshlack. 2017. “Splatter: Simulation of Single-Cell RNA Sequencing Data.” Genome Biology 18 (1). BioMed Central: 1–15.